Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor benchmark workflow #52

Merged
merged 27 commits into from
Oct 28, 2022
Merged

Refactor benchmark workflow #52

merged 27 commits into from
Oct 28, 2022

Conversation

andreab1997
Copy link
Contributor

As @alecandido suggested, I am refactoring the workflow in order to download the files needed for benchmarks directly from the server (https://data.nnpdf.science/)

@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2022

Codecov Report

Merging #52 (7ac0cbe) into main (4b1d511) will decrease coverage by 42.07%.
The diff coverage is 30.95%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #52       +/-   ##
===========================================
- Coverage   88.51%   46.44%   -42.08%     
===========================================
  Files          16       17        +1     
  Lines         592      618       +26     
===========================================
- Hits          524      287      -237     
- Misses         68      331      +263     
Flag Coverage Δ
bench ?
unittests 46.44% <30.95%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pineko/check.py 83.72% <0.00%> (-9.31%) ⬇️
src/pineko/comparator.py 18.18% <0.00%> (-72.73%) ⬇️
src/pineko/ekompatibility.py 21.42% <21.42%> (ø)
src/pineko/evolve.py 25.45% <25.00%> (-63.44%) ⬇️
src/pineko/theory.py 19.76% <25.00%> (-65.03%) ⬇️
src/pineko/cli/check.py 55.76% <50.00%> (-38.47%) ⬇️
src/pineko/cli/convolute.py 66.66% <50.00%> (-7.25%) ⬇️
src/pineko/theory_card.py 16.00% <0.00%> (-68.00%) ⬇️
... and 7 more

@andreab1997
Copy link
Contributor Author

This should be ready but I believe we need to install wget on the container, right? @alecandido

@felixhekhorn
Copy link
Contributor

as @alecandido said: please separate a script that you call in the workflow, but that one can run even with out (i.e. locally)

@andreab1997
Copy link
Contributor Author

As you can see, the tests after ebf6342 work. This is because I have just used apt-get install wget in the workflow (and of course this is not what we want). However, at least we know that the only thing that we miss is to install wget in the container.

@felixhekhorn felixhekhorn added enhancement New feature or request refactor Refactor code labels Oct 20, 2022
@alecandido
Copy link
Member

However, at least we know that the only thing that we miss is to install wget in the container.

Just to spread the knowledge about how containers are made:

  1. you should go here https://github.com/N3PDF/workflows/blob/main/containers/bench-evol/Containerfile and add wget to the list of packages to be installed, in the second stage
  2. you should follow the instructions here https://github.com/N3PDF/workflows/blob/main/containers/README.md to build and upload the new image

But don't do it, there is already curl installed, so replace wget by curl (curl main purpose is making HTTP requests, so it is not downloading files by default, but you can with a couple of options).

@alecandido
Copy link
Member

In particular, use:

curl -o output.html http://google.com/

https://stackoverflow.com/a/9360357/8653979

@alecandido
Copy link
Member

alecandido commented Oct 20, 2022

Ah no, wait. My bad, you are using recursive download, so you need wget.

Keep going with build and upload, as explained in the instructions above.

Also, move the download_test_files.sh top-level. It is useful also for local testing, no need to hide :)

@andreab1997
Copy link
Contributor Author

Ah no, wait. My bad, you are using recursive download, so you need wget.

Keep going with build and upload, as explained in the instructions above.

Also, move the download_test_files.sh top-level. It is useful also for local testing, no need to hide :)

Yes, I was already struggling to do recursive download with curl :)

@andreab1997
Copy link
Contributor Author

2. https://github.com/N3PDF/workflows/blob/main/containers/README.md

Anyway I believe I need some to provide the github token in some way because if I do podman pull ghcr.io/n3pdf/bench-evol:latest I get the error Error: initializing source docker://ghcr.io/n3pdf/bench-evol:latest: Requesting bearer token: invalid status code from registry 403 (Forbidden)

@alecandido
Copy link
Member

alecandido commented Oct 20, 2022

# login to GitHub registry with user credentials (not organization), see [3]
echo ${PAT} | podman login ghcr.io -u <username> --password-stdin

[...]
[3]: official GitHub registry docs

As specified in the linked docs, you should give a PAT of yours the write:packages scope.

P.S.: it is not the $GITHUB_TOKEN of the workflow, that's a different thing, it is a PAT (Personal Access Token)

@andreab1997
Copy link
Contributor Author

# login to GitHub registry with user credentials (not organization), see [3]
echo ${PAT} | podman login ghcr.io -u <username> --password-stdin

[...]
[3]: official GitHub registry docs

As specified in the linked docs, you should give a PAT of yours the write:packages scope.

P.S.: it is not the $GITHUB_TOKEN of the workflow, that's a different thing, it is a PAT (Personal Access Token)

Ok thanks, now it works. However, maybe the install.sh script is broken for lhapdf. I have been trying for a while but it still fails. The error seems to be g++: fatal error: Killed signal terminated program cc1plus compilation terminated. make[2]: *** [Makefile:557: liblhapdf_yaml_cpp_la-emitterutils.lo] Error 1
Any clues?

@alecandido
Copy link
Member

No, wait, I'm stupid. I pointed you to the bench-evol container, but here you are using the lhapdf one...

@andreab1997
Copy link
Contributor Author

No, wait, I'm stupid. I pointed you to the bench-evol container, but here you are using the lhapdf one...

Here in bench.yml I am using bench-evol that is why I am trying to build it. Should I use something else? Anyway I need lhapdf for the benchmarks

@alecandido
Copy link
Member

If you just need lhapdf, and not APFEL or Pegasus or QCDNUM, then use the more restricted lhapdf container (same repo).

However, I tried to run install.sh and I had no issue. Try yourself to do it in a fresh environment (create with virtualenv env and just activate), and (if still not working) try on a Linux machine.

@andreab1997
Copy link
Contributor Author

If you just need lhapdf, and not APFEL or Pegasus or QCDNUM, then use the more restricted lhapdf container (same repo).

However, I tried to run install.sh and I had no issue. Try yourself to do it in a fresh environment (create with virtualenv env and just activate), and (if still not working) try on a Linux machine.

But actually the install script fails on the container (so after having done ./build.bash bench-evol) which is a Linux machine, right?

@alecandido
Copy link
Member

g++: fatal error: Killed signal terminated program cc1plus compilation terminated. make[2]: *** [Makefile:557: liblhapdf_yaml_cpp_la-emitterutils.lo] Error 1 Any clues?

Killed signal terminated program is telling you that something killed it, like when you press Ctrl-C...
Compiled on dom, and it is just working.

@alecandido
Copy link
Member

alecandido commented Oct 20, 2022

Ok, I actually have the new image on dom, but I can't upload it, because of the following error:

Error: writing blob: initiating layer upload to /v2/n3pdf/bench-evol/blobs/uploads/ in ghcr.io: denied: permission_denied: write_package

Unfortunately, GitHub just introduced fine-grained PAT, and my classic token, with write:packages permission, it is not enough to publish a package to the @N3PDF organization.
Moreover, I can't issue a fine-grained token with access to N3PDF/workflows, because it is not allowed by @N3PDF organization.

Both (classic and fine-grained) can be allowed to perform actions on @N3PDF, but @scarrazza or @scarlehoff has to do it explicitly at:
https://github.com/organizations/N3PDF/settings/personal-access-tokens-onboarding

(I'm not an admin of @N3PDF, so I can't do it myself)

@alecandido
Copy link
Member

Ok, now I've done everything I could: we also have a releasing workflow, such that packages can be built in the CI and released automagically.
https://github.com/N3PDF/workflows/blob/v2/.github/workflows/publish-containers.yml

But it is still failing:
https://github.com/N3PDF/workflows/actions/runs/3302837504
with the usual error:

Error: writing blob: initiating layer upload to /v2/n3pdf/lhapdf/blobs/uploads/ in ghcr.io: denied: permission_denied: write_package

So, not even the ${{ github.token }} (or ${GITHUB_TOKEN}, is just the same) is able to release, here the default permissions:
https://docs.github.com/en/actions/security-guides/automatic-token-authentication#permissions-for-the-github_token

Therefore, it is really up to @N3PDF admins, since the Default access (permissive) has the write:packages scope, this means it has been somehow set to Default access (restricted).

Here the guide on how to control permissions for the ${GITHUB_TOKEN}.
https://docs.github.com/en/actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token

The moment it is fixed, we can get the new container just by adding a tag, or rerunning the last workflow.

Now, I can't do anything more for this PR, it is totally up to @scarrazza and @scarlehoff

@alecandido
Copy link
Member

Actually, I can't experiment not being an admin, so I don't know what is on the other side, but maybe is also a GitHub problem.

In particular, if you read the output of the first section, "Set up job", there is a dedicated item to "GITHUB_TOKEN Permissions", and this is the output:


GITHUB_TOKEN Permissions
  Actions: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write

But consider that 8 hours ago these people (the one providing the push action I'm using) were able to release with an extremely similar workflow:
https://github.com/redhat-actions/push-to-registry/actions/runs/3301376932/jobs/5446774905

@alecandido
Copy link
Member

Latest update: everything still holds for new packages, I have no idea on how to publish them if the tokens are not accepted.

However, for existing packages of which I'm the owner I could solve it. So, not bench_evol, since @niclaurenti is the owner, but I can instruct him on how to do it.

But here the dependency is only lhapdf, and for that there is an lhapdf container I could publish (since I'm the owner of that). bench_evol has been tailored for EKO benchmarks, providing APFEL, QCDNUM, Pegasus, and so on.

So, the solution was to switch benchmark.

Now it is still failing because of the dependency on lz4, since it requires to build the wheel and gcc is not available in the container. But this is in the range of things that have a straight solution :)
(now it is more a problem that there are more than one, so I will try to pick the best, but at some point any is fine to begin with)

@alecandido
Copy link
Member

alecandido commented Oct 23, 2022

Problem with lz4 solved on EKO side, now it is fine.

Current problem is that I had to release the version with Composite Output struct, so the old output.Output is not any longer there.
Two solutions:

  1. we revert the version of EKO, and we merge with the failing workflow (then open a PR to update the dependency and the EKO API usage)
  2. we update immediately here

I believe it should not require a major rework to use the new structure, it's just a matter of updating the usage here and there. I would go for 2.

P.S.: now it is not a workflow/containter problem any longer, even if there is something fishy going on from the container side (but EKO isobench are passing...)

@felixhekhorn
Copy link
Contributor

Current problem is that I had to release the version with Composite Output struct, so the old output.Output is not any longer there.

for that exact reason: can you please drop the new releases on the 0.10 branch and commit instead a 0.11 version? Otherwise we run into lot's of trouble all over the place, because poetry thinks they are compatible, which they are not

@alecandido
Copy link
Member

I actually thought the were compatible, I didn't remember I dropped the struct. However, you can keep going and reshuffle the tags :)

@felixhekhorn
Copy link
Contributor

I think a tag can not be renamed, can it? (that wouldn't make sense to me ...) and, I guess, the problem is PyPI (since poetry will look there), right?

@alecandido
Copy link
Member

With "reshuffle tags" I mean:

  • delete the tag from the repo (actually, from v0.10.3 to v0.10.5 included)
  • yank the release from PyPI (there is only 0.10.5, the other workflows failed)
  • release v0.11.0, and it will automatically generate the new version on PyPI

@felixhekhorn
Copy link
Contributor

ok, I believe to have done ...

@andreab1997
Copy link
Contributor Author

So, what is missing here? I see that the workflows are failing for the reason you mentioned, are you taking care of this or should I do something?

@alecandido
Copy link
Member

Just rebased on main. There was a conflict just with the renaming q2_grid -> muf2_grid and similar.

@andreab1997
Copy link
Contributor Author

I am reviewing it now :)

@andreab1997
Copy link
Contributor Author

Everything seems fine now (of course to be also tested on an actual FK computation). Anyway tests are passing and I don't see anything wrong. So I am asking your review again I then I believe we can merge. Thanks for this! @alecandido @felixhekhorn

@andreab1997
Copy link
Contributor Author

@felixhekhorn I have added you at least you know what it is happened :)

@alecandido alecandido merged commit 85ca767 into main Oct 28, 2022
@alecandido alecandido deleted the refactor_workflow branch October 28, 2022 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Refactor code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants